<?xml version="1.0" encoding="iso-8859-1"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <link title="Purple" rel="stylesheet" href="manual-purple.css" type="text/css" />
  <title>V9938 VRAM timings</title>
</head>

<body>

<h1>V9938 VRAM timings</h1>

Measurements done by: Joost Yervante Damad, Alex Wulms, Wouter Vermaelen<br/>
Analysis done by: Wouter Vermaelen<br/>
Text written by: Wouter Vermaelen<br/>
with help from the rest of the openMSX team.

<h2>Introduction</h2>

<p>This text describes in detail how, when and why the V9938 reads from and
writes to VRAM in bitmap screen modes (screen 5, 6, 7 and 8). VRAM is accessed
for bitmap and sprite rendering but also for VDP command execution or by
CPU VRAM read/write requests.</p>

<h5 id="motivation">motivation</h5>
<p>Modern MSX emulators like blueMSX and openMSX are already fairly accurate.
And for most practical applications like games or even demos they are already
<i>good enough</i>. Though there are cases where you can still clearly see the
difference between a real and an emulated MSX machine.</p>

<p>For example the following pictures show the speed of the LINE command for
different slopes of the line. The first two pictures are generated on different
MSX emulators, the last picture is from a real MSX. Without going into all the
details: lines are drawn from the center of the image to each point at the
border. While the LINE commands are executing, the command color register is
rapidly changed (at a fixed rate). So faster varying colors indicate a slower
executing command.</p>

  <img src="line-speed-old-8.png" width="300">
  <img src="line-speed-emu-8.png" width="300">
  <img src="line-speed-real-8.png" width="300">

<p>From left to right these pictures show:</p>
<ul>
<li>(left) The output of MSX emulators that use Alex Wulms' original command
engine emulation core. All(?) modern MSX emulators use this core, including
blueMSX, OCM and (older versions of) openMSX. The output are squares, this
indicates that the speed of a LINE command doesn't depend on the slope of the
line.</li>
<li>(center) The output of openMSX version 0.9.1. Here the command engine was
tweaked to take the slope of the line into account, so the test now generates
clean octagonals.</li>
<li>(right) The output of a real MSX. The overall shape is also an octagonal.
But there are also a lot of irregularities. These irregularities can be
reproduced when running the test multiple times. So it must be a <i>real</i>
effect, and not some kind of measurement noise.</li>
</ul>

<p>This test is derived from NYRIKKI's test program described in this (long) <a
href="http://www.msx.org/forum/msx-talk/software-and-gaming/line">MRC forum
thread</a>. This particular test is not that important. But because it
generates a nice graphical output it allows to show the problem without going
into too much technical details (yet).</p>

<p>In most MSX applications these LINE speed differences, or small command
speed differences in general, likely won't cause any problems. (Except of
course in programs like this that specifically test for it.) But it would still
be nice to improve the emulators.</p>

<h5>measurements</h5>
<p>To be able to improve openMSX further we need to have a good understanding
of what it is exactly that causes these irregularities. It would be very hard
to figure out this stuff only by using MSX test programs. It might be easier to
look at the deeper hardware level. More specifically at the communication
between the VDP (V9938) and the VRAM chips. This should allow us to see when
exactly the VDP reads or writes which VRAM addresses.</p>

<p>So at the 2013 MSX fair in Nijmegen we (some member of the openMSX team and
I) connected a logic analyzer to the VDP-VRAM bus in a Philips NMS8250 machine.
The following picture gives an impression of our measurement setup.</p>
  <img src="v9938-probes.jpg">

<p>Next we ran some MSX software that puts the VDP in a certain display mode.
It enables/disables screen and/or sprite rendering. And it optionally executes
VDP commands and/or accesses VRAM via the CPU. And while this test was running
we could capture (small chunks of) the communication between the VDP and the
VRAM. This gives us output (waveforms) like in the following image.</p>
  <img src="gtkwave.png">

<p>It's not so easy to go from this waveform data to meaningful results about
how the VDP operates. This text also won't talk about this analysis process. If
you're interested in the analysis or in the raw measurement data, you can find
some more details in the <a
href="https://sourceforge.net/mailarchive/message.php?msg_id=30375119">
openmsx-devel mailinglist archive</a>. The rest of this text will only discuss
the final results of the analysis.</p>

<p>Because one of the primary goals was to improve the command engine emulation
in openMSX, the measurements mostly focused on the bitmap screen modes (a V9938
doesn't allow commands in non-bitmap modes). So the following sections will
only occasionally mention text or character modes. Because we used a V9938 we
also couldn't test the YJK modes (screen 11 and 12). But it's highly likely
that, from a VRAM access point of view, these modes behave the same as screen 8
(or as we'll see later, the same as all the bitmap screen modes).</p>



<h2>VRAM accesses</h2>

<p>Before presenting the actual results of (the analysis of) the measurements,
this section first explains the general workings of the VDP-VRAM communication.
This is mostly a description of the functional interface of DRAM chips, but
then specifically applied to the VDP case. Feel free to skim (or even skip)
this section.</p>

<p>Like most RAM chips in MSX machines, the VDP uses DRAM chips for the video
RAM. There exist many variations in DRAM chips. You can find a whole lot of
information on <a
href="http://en.wikipedia.org/wiki/Dynamic_random-access_memory">
wikipedia</a>. Most of the info in this section can also be found in the 'V9938
Technical Data Book'. Often that book goes into a lot more detail than this
text. Here I highlight (and simplify) the aspects that are relevant to
understand the later sections in this text.</p>


<h3>Connection between VDP and VRAM</h3>

<p>Between the VDP and the VRAM chips there is an 8-bit data bus. This means
that a single read or write access will transfer 1 byte of data.</p>

<p>There is also an 8-bit address bus. Obviously 8 bits are not enough to
address the full 128kB or even 192kB VRAM address space. Instead the address is
transferred in two steps. First the row-address is transferred followed by the
column address. (Usually) the row address corresponds to bits 15-8 of the full
address, while the column address corresponds to bits 7-0.</p>

<p>Though this still only allows to address up-to 64kB. To get to 128kB, there
are 2 separate column-address-select signals (named CAS0 and CAS1). These two
signals allow to select one of the two available 64kB banks. So combined this
gives 128kB. (Usually) you can interpret CAS0/CAS1 as bit 16 of the
address.</p>

<p>In case of a MSX machine with 192kB VRAM there is still a third signal:
CASX. To simplify the rest of this text, this possibility is ignored. It anyway
doesn't fundamentally change anything.</p>

<p>Next to the data and address bus there are still some control signals. I've
already mentioned the CAS signals (used to select the column address). There's
a similar RAS (row address select) signal. And finally there's a R/W
(read-write) signal that indicates whether the access is a read or a write.</p>


<h3>Timing of the VDP-VRAM signals</h3>

<p>When the VDP wants to read or write a byte from/to VRAM it has to
<i>wiggle</i> the signals that connect the VDP to the VRAM in a certain way.
This section describes the timing of those <i>wiggles</i>.</p>

<p>The timing description in this section is different from the description in
the 'VDP Technical Data Book'. The Data Book has the <i>real</i> timings,
including all the subtle details for how to build an actual working system.
This text has all the timings rounded to integer multiples of VDP clock cycles.
IMHO these simplified timings make the VDP-VRAM connection easier to understand
from a <i>functional</i> point of view.</p>


<h4>A single write</h4>

<p>To write a single byte to VRAM, follow this schema:</p>
<img src="dram-write.png">
<ul>
 <li>Put the row address on the address bus and activate the RAS signal. Most
 signals are active-low, so activating means make the signal low.</li>
 <li>After one cycle (remember these are <i>functional</i> timings, especially
 in this step the <i>real</i> timing rules are more complex):</li>
  <ul>
    <li>Activate (one of) the CAS signals.</li>
    <li>Put the column address on the address bus.</li>
    <li>Set the R/W signal. A low signal means write.</li>
    <li>Put the to-be-written data on the data bus.</li>
   </ul>
 </li>
 <li>After two cycles the CAS signal can be deactivated. At this point the
 value of the R/W signal doesn't matter anymore (it may have any value). But
 measurements show that the VDP restores the R/W signal to a high value at this
 point.</li>
 <li>Again one cycle later, the RAS signal can be deactivated.</li>
 <li>The RAS signal has to remain de-active for at least two cycles.</li>
</ul>

<p>So a full write cycle takes 6 VDP clock cycles.</p>


<h4>A single read</h4>

<p>Reads are very similar to writes, they follow this schema:</p>
<img src="dram-read.png">
<ul>
 <li>Put the row address on the address bus and activate the RAS signal.</li>
 <li>After one cycle:</li>
   <ul>
     <li>Activate (one of) the CAS signals.</li>
     <li>Put the column address on the address bus.</li>
     <li>Set the R/W signal: a high value indicates a read. The VDP keeps this
     signal high between VRAM transactions. So in measurements you don't
     actually see this signal changing for reads.</li>
   </ul>
 </li>
 <li>After two cycles the read data is available on the data bus. The CAS signal
   can be deactivated now.</li>
 <li>After one cycle the RAS signal can be deactivated.</li>
 <li>Wait at least two cycles before starting the next VRAM transaction.</li>
</ul>

<p>So this is very similar to a write: address selection is identical.
Obviously the R/W signal and the direction (and timing) of the information on
the data bus is different. And just like a write, a full read cycle also takes
6 VDP cycles.</p>


<h4>Page mode reads (burst read)</h4>

<p>Often the VDP needs to read data from successive VRAM addresses. If those
addresses all have the same row address, then there's a faster way to perform
this compared to doing multiple reads like in the schema above.</p>

<img src="dram-read-burst.png">
<ul>
 <li>Put the (common) row address on the address bus and activate the RAS
 signal.</li>
 <li>After one cycle:</li>
   <ul>
     <li>Put the first column address on the address bus.</li>
     <li>Activate (one of) the CAS signals.</li>
     <li>Set the R/W signal (though the VDP already has this signal in the
     correct state).</li>
   </ul>
 <li>After two cycles read the data from the data bus, and deactivate CAS.</li>
 <li>Two cycles later, put the 2nd column address on the address bus and
   re-activate (one of) the CAS signals.</li>
 <li>Again two cycles later read the data and deactivate CAS.</li>
 <li>It's possible to repeat this process for a 3rd, 4th, &hellip; byte.</li>
 <li>After one cycle deactivate the RAS signal.</li>
 <li>Wait at least two cycles before starting the next VRAM transaction.</li>
</ul>

<p>The above diagram shows a burst-length of only two bytes. It's also possible
to have longer lengths. The VDP uses lengths up-to 4 bytes (or 8, see next
section).</p>

<p>In this example reading two bytes takes 10 VDP cycles. Doing two single
reads would take 2&times;6=12 cycles. When doing longer bursts, the savings
become bigger. Doing a burst of N reads takes 2+4&times;N cycles compared to
6&times;N cycles for a sequence of single reads.</p>

<p>In principle it's also possible to do burst-writes. Though the VDP doesn't
use them (it never needs to write more than 1 byte in a sequence).</p>


<h4>Multi-bank page mode reads</h4>

<p>Burst reads are already faster than single-reads. But to be able to render
screen 7 and 8 images, burst reads are still not fast enough. In these two
screen modes, to be able to read the required data from VRAM fast enough, the
VDP reads from two banks in parallel.</p>

<img src="dram-read-burst-2banks.png">

<p>There are 2 banks of 64kB. These two banks share the RAS control signal, but
they each have their own CAS signal. The address and data signals are also
shared. This allows to read from both banks <i>almost</i> in parallel:</p>
<ul>
 <li>In burst mode it was possible to read one byte every 4 VDP cycles. For
 this the CAS signal had two be alternatingly two cycles high and two cycles
 low. The address and data buses are only used during 1 of these 4 cycles.</li>
 <li>Multi-bank mode uses both the CAS0 and the CAS1 signals. CAS0 is high when
 CAS1 is low and vice-versa. When looking at a single bank (which only sees one
 of the two CAS signals) this looks like a normal burst read. The only
 difference is that the RAS signal is at the start or at the end 2 cycles
 longer active than strictly needed. But that's perfectly fine.</li>
</ul>

<p>So this schema gives (almost) double the VRAM-bandwidth. The only
requirement is that you alternatingly read from bank0 and bank1. At first sight
this requirement seems so strict that it is almost never possible to make use of
this banked reading mode: to render screen 7 or 8 you indeed need to read many
successive VRAM locations, not locations that alternatingly come from the
1st and 2nd 64kB bank.</p>

<p>To make it possible to use banked reading mode, the VDP interleaves the two
banks. This introduces the concept of <i>logical</i> and <i>physical</i>
addresses:</p>
<ul>
 <li><i>Logical</i> addresses are the addresses that a programmer of the VDP
 normally uses. For example the bitmap data for screen 8 (possibly) starts at
 address 0x00000 and goes till address 0x0D400.</li>
 <li><i>Physical</i> addresses are the addresses that actually appear on the
 signals between the VDP and the VRAM. So the combination of the row and column
 address and the CAS0 or CAS1 bank-selection.</li>
</ul>

<p>In most screen modes the logical and the physical addresses are the same.
But in screen 7 and 8 there's a transformation between the two:</p>
<p align="center">physical = (logical &gt;&gt; 1) | (logical &lt;&lt; 16)</p>
<p>So the 17-bit logical address is rotated one bit to the right to get the
physical address. The effect of this transformation is that all even logical
addresses end up in physical bank0 while all odd logical addresses end up in
physical bank1. So now when you read from successive logical addresses you read
from alternating physical banks and thus it is possible to use banked read
mode.</p>

<p>Usually a VDP programmer doesn't need to be aware of this interleaving. But
because interleaving is only enabled in screen 7 and 8, this effect can become
visible when switching between screen modes. <i>An alternative design decision
could have been to always interleave the addresses. I guess the V9938 designers
didn't make this choice to allow for single chip configurations in case only
64kB VRAM is connected.</i></p>

<p>The diagram above shows a read of 2&times;2 bytes, in reality the VDP only
uses this schema to read 2&times;4 bytes. In principle it's also possible to
write to two banks in parallel, but the VDP never needs this ability.</p>

<h4>Refresh</h4>

<p>DRAM chips need to be refreshed regularly. The VDP is responsible for doing
this (there are DRAM chips that handle refresh internally, but the VDP doesn't
use such chips). Many DRAM chips allow a refresh by only activating and
deactivating the RAS signal, so without actually performing a read or write in
between. When extrapolating from the above timing diagrams, this would only
cost 4 cycles. Though the VDP doesn't actually use this RAS-without-CAS refresh
mode. Instead it performs a regular read access which takes 6 cycles.</p>

<p>Each time a read (or write) is performed on a certain row of a DRAM chip,
that whole row is refreshed. So to refresh the whole RAM, the VDP has to
periodically read (any column address of) each of the 256 possible rows.</p>



<h2>Distribution of VRAM accesses</h2>

<p>The previous section described the details of isolated (single or burst)
VRAM accesses. This section will look at such accesses as indivisible units and
examine how these units are grouped together and spread in time to perform all
the VRAM related stuff the VDP has to do.</p>

<p>The VDP can perform VRAM reads/writes for the following reasons:</p>
<ul>
 <li>Refresh</li>
 <li>Bitmap rendering</li>
 <li>Sprite rendering</li>
 <li>CPU read/write</li>
 <li>Command read/write</li>
</ul>
<p>Note that next to bitmap modes, the VDP also has character and text modes. I
didn't investigate those modes yet, so this text mostly ignores them.</p>

<p>The rest of this text explains when in time (at which specific VDP
cycles) accesses of each type are executed.</p>

<p>We'll first focus on refresh and bitmap/sprite rendering. Later we'll add
CPU and command engine. The reason for this split is that the first group has a
fairly simple pattern: refreshes always occur at fixed moments in time.
Enabling bitmap rendering only adds additional VRAM reads but has no influence
on the timing of the refreshes. Similarly enabling sprite rendering adds even
more reads without influencing the bitmap or refresh reads. CPU and command
accesses on the other hand cannot simply be added to this schema without
influencing each other. So those are postponed till a later section.</p>

<h3>Horizontal line timing</h3>

<p>The VDP renders a full frame line-by-line. For each line the VDP (possibly)
has to read some bitmap and sprite data from VRAM. It's logical to assume (and
the measurements confirm this) that the data fetches within one line occur at
the same relative positions as the corresponding data fetches of another line.
So if we can figure out the details for one line, we can extrapolate this to a
whole frame. Similarly we can assume that different frames will have similar
relative timings. So really all we need to know is the timing of one line.</p>

<p><i>TODO: odd and even frames in interlace mode probably do have timing
differences. Still need to investigate this.</i>
</p>

<p>Let's thus first look at what we already know about an horizontal display
line. The 'V9938 Technical Data Book' contains the following timing info about
(non-text mode) display lines.</p>

<table>
<tr><th>Description       </th><th>Cycles       </th><th>Length</th></tr>
<tr><td>Synchronize signal</td><td>[0    -  100)</td><td> 100</td></tr>
<tr><td>Left erase time   </td><td>[100  -  202)</td><td> 102</td></tr>
<tr><td>Left border       </td><td>[202  -  258)</td><td>  56</td></tr>
<tr><td>Display cycle     </td><td>[258  - 1282)</td><td>1024</td></tr>
<tr><td>Right border      </td><td>[1282 - 1341)</td><td>  59</td></tr>
<tr><td>Right erase time  </td><td>[1341 - 1368)</td><td>  27</td></tr>
<tr><td>Total             </td><td>[0    - 1368)</td><td>1368</td></tr>
</table>

<p>So one display line is divided in 6 periods. The total length of one line is
1368 cycles. The previous section showed how long individual VRAM accesses
take. The next sections will figure out how all the required accesses fit in
this per-line budget of 1368 cycles.</p>

<p>A note about the timing notation: in this text all the timing numbers are
VDP cycles relative within one line. For example in the table above the display
period starts at cycle 258. The display period of the next line will start at
cycle 258+1368=1626, the next at cycle 2994 and so on. To make the values
smaller, all cycle numbers will be folded to the interval [0, 1368). The
staring point (cycle=0) has no special meaning. We could have taken any other
point and called that the starting point. (For the current choice, the external
VDP HSYNC pin gets activated at cycle=0, so it was a convenient point to
synchronize the measurements on).</p>

<p><i>TODO horizontal set-adjust: The numbers in the above table are valid for
horizontal set-adjust=0. Similarly all our measurements were done with
set-adjust=0. Using different set-adjust values will make the left/right border
bigger/smaller. I still need to figure out which timing values of the next
sections are changed by this. E.g. are all the VRAM accesses in a line shifted
as a whole, or are just the bitmap data fetches shifted and remain (some) other
accesses fixed?</i></p>

<p><i>TODO bits S1,S0 in VDP register R#9: The above table is valid for
S1,S0=0,0. In other cases the length of a display line is only 1365 cycles
instead of 1368. The rest of this text assumes a line length of 1368 cycles. I
still need to figure out where exactly in the line this difference of 3 cycles
is located.</i></p>

<!-- numbers for 1365 cycles
[0   - 100) (len= 100)
[100 - 202) (len= 102)
[202 - 258) (len=  56)
[258 -1282) (len=1024)
[1282-1339) (len=  57)
[1339-1365) (len=  26)-->

<h3>Sneak preview</h3>

<p>The following image graphically summarizes the results of the rest of this
section. This is a very wide image, it is much larger than what can be shown
inline in this text (click to see the full image). It's highly recommended to
open this image in an external image viewer that allows to easily zoom in and
out and scroll the image.</p>

<a href="vdp-timing.png">
<img src="vdp-timing.png" width="1200">
</a>

<p>Here's an overview of the most important items in this image:</p>
<ul>
<li>Horizontally there are 6 regions in the image (each has a slightly
different background color). These regions correspond to the 'synchronize',
'left/right erase', 'left/right border' and 'display' regions in the table from
the previous section.</li>
<li>Horizontally you also see a timeline going from 0 to 1368 cycles. This
corresponds to one full display line.</li>
<li>Vertically there are 3 big groups: 'screen off', 'no sprites' and
'sprites', see next section for why these groups are important.</li>
<li>Within one vertical group there is one color-coded band and a set of
RAS/CAS signals. Usually there's one RAS and 2 CAS signals, but the 'sprites
off' group has 2 pairs of CAS signals. For the 'sprites off' and 'sprites on'
groups there are subtle differences in the CAS0/1 signals between screen modes
5/6 and 7/8. But to save space these differences are only shown once.</li>
<li>The colors in the color-coded band have the following meaning:</li>
 <ul>
  <li>red: refresh read</li>
  <li>green: bitmap data read (dark-green is dummy bitmap read)</li>
  <li>yellow: sprite data read (brown is dummy sprite read)</li>
  <li>blue: potential CPU or command engine read or write</li>
  <li>dark-grey: dummy read</li>
  <li>light-gray: idle (no read or write)</li>
 </ul>
<li>The CAS signals are drawn in either a full or a stippled line. Full means
the signal is definitely high/low at this point. Stippled means, it can be high
or low depending on whether there was a CPU request or VDP command executing at
that point. Note that the RAS signal always toggles, even if there is no CPU or
command access required.</li>
</ul>

<p>The next sections will go into a lot more detail. It's probably a good idea
to have this (zoomed in) image open while reading those later sections.</p>


<h3>3 operating modes</h3>

<p>When looking from a VDP-VRAM interaction point of view, the VDP can operate
in 3 modes:</p>
<ul>
 <li>Screen disabled (sprite status doesn't matter). This is the same as
     vertical border.</li>
 <li>Screen enabled, sprites disabled.</li>
 <li>Screen enabled, sprites enabled.</li>
</ul>

<p>Note that the (bitmap) screen mode (screen 5, 6, 7, or 8) largely doesn't
matter for the VRAM access pattern.</p>

<p><i>TODO sprite fetching happens 1 line earlier than displaying those sprites
(see below for details). This means that the last line of the vertical border
before the display area likely uses a 'mixed mode' where it doesn't yet fetch
bitmap data but it does already fetch sprite data. I didn't specifically
measure this condition, so I can't really tell anything about this mixed mode.
(One possibility is that it's just like a normal display line, but the fetched
bitmap data is ignored.) Similarly the last line of the display area doesn't
strictly need to fetch new sprite data.</i></p>

<p>We'll now look at these 3 modes in more detail.</p>


<h4>Screen disabled</h4>

<h5>refresh</h5>
<p>Screen rendering can be disabled via bit 6 in VDP register R#1. There's also
no screen rendering when the VDP is showing a vertical border line. From a
VRAM-access point of view both cases are identical.</p>

<p>In this mode the VDP doesn't need to fetch any data from VRAM for
rendering. It only needs to refresh the VRAM. As already mentioned earlier,
the VDP uses a regular read to refresh the RAM, so this takes 6 cycles.</p>

<p>The VDP executes 8 refresh actions per display line. They start at the
following moments in time (the red blocks in the big timing diagram):</p>
<table>
<tr><td>284</td><td>412</td><td>540</td><td>668</td>
    <td>796</td><td>924</td><td>1052</td><td>1180</td></tr>
</table>

<h5>refresh-addresses</h5>
<p><i>I didn't investigate this refresh-address-stuff in detail because it
doesn't matter for emulation accuracy</i>.</p>

<p>The logical addresses used for refresh reads seems to be of the form:</p>
<p align="center">N&times;0x10101 | 0x3F</p>
<p>Where N increases on each refresh action. So each refresh the row address
increases by one and every other refresh either the CAS0 or the CAS1 signal
gets used (the columns address doesn't matter for refresh). Note that this
formula is for the logical address, in screen 7/8 this still gets transformed
to a physical address. So in screen 7/8 a refresh action always uses the CAS1
signal. That means that in screen 7/8 the DRAM chip(s) of bank0 actually do get
refreshed using the RAS-without-CAS refresh mode.</p>

<p>The refresh timings are the same for all non-text screen modes. But in text
modes there are only 7 refreshes per line and they are also located at
different relative positions than in the table above. I didn't investigate
this further.</p>


<h5>dummy reads</h5>
<p>Next to the refresh reads, in 'screen disabled' mode, the VDP still performs
4 reads of address 0x1FFFF. At the following moments (marked with dark-grey
blocks on the timeline):</p>
<table><tr><td>1236</td><td>1244</td><td>1252</td><td>1260</td></tr></table>

<p>I can't image any use for these reads, so let's call them dummy reads. In all
our measurements these dummy reads always re-occur in these same positions, so
it's not a fluke in (only one of) the measurements.</p>

<p>The refresh actions remain exactly the same in the other two modes. But
these dummy reads are different in the mode 'sprites off' or disappear
completely in the mode 'sprites on'. (This confirms that nothing 'useful' is
done by these dummy reads).</p>

<p>Anyway for emulation we can mostly ignore these dummy reads. It only matters
that at these moments in time there cannot be CPU or command VRAM reads or
writes.</p>


<h4>screen enabled, sprites disabled</h4>

<h5>refresh and dummy reads</h5>

<p>Refresh works exactly the same as in the previous mode. The dummy reads
are a bit different. Now there are only 3 dummy reads at slightly different
moments (also shown in dark-grey):</p>
<table><tr><td>1242</td><td>1250</td><td>1258</td></tr></table>

<p>The first of these 3 reads is always from address 0x1FFFF. The second and
third dummy read have a pattern in their address. For example:</p>
<table>
<tr><th>1st</th><th>2nd</th><th>3rd</th></tr>
<tr><td>0x1FFFF</td><td>0x03B80</td><td>0x03B82</td></tr>
<tr><td>0x1FFFF</td><td>0x03C00</td><td>0x03C02</td></tr>
<tr><td>0x1FFFF</td><td>0x03C80</td><td>0x03C82</td></tr>
<tr><td>0x1FFFF</td><td>0x03D00</td><td>0x03D02</td></tr>
</table>
<p>This table shows the addresses of the 3 dummy reads for 4 successive display
lines (this is data from an actual measurement, unfortunately our equipment
could only buffer up to 4 lines). The lower 7 bits of the address of the 2nd
read always seem to be zero. The address of the 3rd read is the same as for the
2nd read except that bit 1 is set to 1. When going from one line to the next,
the address increases by 0x80. Our measurements captured 10 independent sets of
4 successive lines. Each time bits 16-15 were zero (bits 14-7 do take different
values). This could be a coincidence, or it could be that these bits really
aren't included in the counter. Note that again these are logical addresses (so
still transformed for screen 7/8). I didn't investigate these dummy reads in
more detail because they mostly don't matter for emulation.</p>


<h5>bitmap reads</h5>
<p>The major change compared to the previous mode is that now the VDP needs to
fetch extra data for the bitmap rendering. These fetches happen in 32 blocks of
4 bytes (screen 5/6) or 8 bytes (screen 7/8). The fetches within one block
happen in burst mode. This means that one block takes 18 cycles (screen 5/6) or
20 cycles (screen 7/8). Though later we'll see that the two spare cycles for
screen 5/6 are not used for anything else, so for simplicity we can say that in
all bitmap modes a bitmap-fetch-block takes 20 cycles. This is even clearer if
you look at the RAS signal: this signal follows the exact same pattern in all
(bitmap) screen modes, so in screen 5/6 it remains active for two cycles longer
than strictly necessary.</p>

<p>Actually before these 32 blocks there's one extra dummy block. This block
has the same timing as the other blocks, but it always reads address 0x1FFFF.
From an emulator point of view, these dummy reads don't matter, it only matters
that at those moments no other VRAM accesses can occur.</p>

<p>The start of these 1+32 blocks are located at these moments in time (these
are the green blocks in the big timing diagram):</p>
<table>
<tr><td>(195)</td><td> 227</td><td> 259</td><td> 291</td><td> 323</td>
                  <td> 355</td><td> 387</td><td> 419</td><td> 451</td></tr>
<tr><td>     </td><td> 483</td><td> 515</td><td> 547</td><td> 579</td>
                  <td> 611</td><td> 643</td><td> 675</td><td> 707</td></tr>
<tr><td>     </td><td> 739</td><td> 771</td><td> 803</td><td> 835</td>
                  <td> 867</td><td> 899</td><td> 931</td><td> 963</td></tr>
<tr><td>     </td><td> 995</td><td>1027</td><td>1059</td><td>1091</td>
                  <td>1123</td><td>1155</td><td>1187</td><td>1219</td></tr>
</table>

<p><i>The following is only speculation: I wonder why there is such a dummy
preamble block. Theoretically this <b>could</b> have been used (or reserved) to
implement V9958-like horizontal scrolling without having to mask 8 border
pixels. Unfortunately horizontal scrolling on a V9958 doesn't work like that
:(</i></p>

<h4>screen enabled, sprites enabled</h4>

<h5>refresh, dummy reads, bitmap reads</h5>
<p>Refresh and bitmap reads are exactly the same as in the previous mode. But
the 3 or 4 dummy reads from the previous 2 modes are not present in this
mode.</p>

<h5>sprite reads</h5>
<p><i>I've only investigated bitmap modes, that means the stuff below applies
only to sprite mode 2.</i></p>

<p>For sprite rendering you need to:
<ul>
 <li>Figure out which sprites are visible: There are 32 positions in the
   sprite attribute table, and of those maximum 8 sprites can be visible
   (per line).</li>
 <li>For the visible sprites, fetch the required data so that it can actually
 be drawn. This data is: the x- and y-coordinates, the sprite pattern number,
 the pattern data and the color data.</li>
</ul>

<p>Figuring out which sprites are visible is done by reading the y-coordinates
of each of the 32 possible sprites. These reads happen interleaved between the
32 block-reads of the bitmap data, so read one byte between each bitmap-block.
Because of this interleaving it's not possible to use burst mode, so each read
takes 6 cycles. There's also 1 dummy read of address 0x1FFFF at the end. The
reads happen at these moments in time (yellow blocks between the green blocks in
the diagram):</p>
<table>
<tr><td> 182</td><td> 214</td><td> 246</td><td> 278</td>
    <td> 310</td><td> 342</td><td> 374</td><td> 406</td></tr>
<tr><td> 438</td><td> 470</td><td> 502</td><td> 534</td>
    <td> 566</td><td> 598</td><td> 630</td><td> 662</td></tr>
<tr><td> 694</td><td> 726</td><td> 758</td><td> 790</td>
    <td> 822</td><td> 854</td><td> 886</td><td> 918</td></tr>
<tr><td> 950</td><td> 982</td><td>1014</td><td>1046</td>
    <td>1078</td><td>1110</td><td>1142</td><td>1174</td><td>(1206)</td></tr>
</table>

<p>In the worst case, the 8 last sprites of the attribute table are visible. In
that case all 32 reads are really required. Though even if the limit of 8
visible sprites is reached earlier, the VDP continues fetching all 32+1 bytes.
Also if one y-coordinate is equal to 216 (meaning that all later sprites are
invisible), still all 32+1 fetches are executed.</p>

<p>Once the VDP has figured out which sprites are visible it needs to fetch the
data to actually draw the sprites. This VRAM access pattern is relatively
complex:</p>
<ul>
<li>In the worst case there are 8 visible sprites. This requires reading
8&times;6 bytes. Some of these reads can be done in burst mode, others are
single byte reads.</li>
<li>Even if there are less than 8 sprites to display, all read accesses do
still occurs. It <i>seems</i> to be that the useless reads are duplicates of
sprite 0. (Or is it the first visible sprite? I didn't look in detail because
it's not important for our purpose. It only matters that the VRAM bus remains
occupied).</li>
<li>The data fetches happens in 4 chunks of each 2 sprites. Each chunk
reads:</li>
 <ul>
  <li>Y-coordinate, x-coordinate and pattern-number of 1st sprite. Burst of 3
  reads, takes 13(!)cycles.</li>
  <li>Y-coordinate, x-coordinate and pattern-number of 2nd sprite. Burst of 3
  reads, takes 13(!)cycles.</li>
  <li>Pause of 6 or 10(!) cycles</li>
  <li>2 pattern bytes of 1st sprite. Burst of 2 reads, takes 10 cycles.</li>
  <li>Color attribute of 1st sprite. Single read, takes 6 cycles.</li>
  <li>2 pattern bytes of 2nd sprite. Burst of 2 reads, takes 10 cycles.</li>
  <li>Color attribute of 2nd sprite. Single read, takes 6 cycles.</li>
 </ul>
<li>Note that the burst of 3 reads only takes 13 instead of the expected 14
cycles. If you look at the RAS/CAS signals you see that this uses an illegal(?)
RAM access pattern: RAS is released together with CAS (even slightly before if
you look at the raw measured data). But obviously this seems to work fine
<i>&hellip; makes me wonder why the VDP doesn't always use this faster
access pattern.</i></li>
<li>Even for 8x8 sprites, the VDP always fetches 2 bytes of pattern-data per
sprite line (and the 2nd byte is ignored).</li>
<li>Note that the y-coordinate is fetched again. It was already fetched to
figure out which sprites are visible.</li>
<li>The positions in time of these reads (single or burst) are like this
(yellow blocks (mostly) in the border period in the big timing diagram):
<table>
<tr><td>1238</td><td>1251</td><td>1270</td><td>1280</td><td>1286</td><td>1296</td></tr>
<tr><td>1302</td><td>1315</td><td>1338</td><td>1348</td><td>1354</td><td>1364</td></tr>
<tr><td>   2</td><td>  15</td><td>  34</td><td>  44</td><td>  50</td><td>  60</td></tr>
<tr><td>  66</td><td>  79</td><td>  98</td><td> 108</td><td> 114</td><td> 124</td></tr>
</table>
Note that some of these fetches occur in the previous and some in the current
display line. Though the start of the display line was chosen arbitrary (we
could have picked the staring point so that these numbers don't wrap). It only
matters that all sprite data is fetched before the display rendering
starts.</li>
<li>Also note that the timing is slightly irregular: in the 1st, 3rd and 4th
group there's a pause of 6 cycles, there fits exactly one other access in this
gap. But in the 2nd group there's a pause of 10 cycles. There also only fits
one other access in this gap, and the timing is 2+6+2, so 2 'wasted' cycles
before and after that other access. <i>I suspect that these 2+2 cycles are
related to the R#9 S1,S0 bits. TODO measure this</i>.</li>
</ul>

<p>It's worth repeating that whenever sprites are enabled, the VDP
<b>always</b> performs the same fetch-pattern. So even if no sprites are
actually visible, or if sprites are partially disabled (with y=216), and even
with 8x8 vs 16x16 sprites, magnified or not. This confirms the fact that the
VDP command engine is slowed down by the exact same amount in all these
situation. Also all (bitmap) screen modes behave exactly the same with respect
to sprite data fetches.</p>



<h3>CPU and command reads/writes</h3>

<h5>position of access slots</h5>
<p>The previous sections explained when the VDP reads from VRAM for refresh and
bitmap/sprite rendering (and even some dummy reads). Depending on the mode
(screen/sprites enabled/disabled), this takes more or less of the available
VRAM-bandwidth. The portion of the VRAM bandwidth that is not used for
rendering can be used for CPU or command engine VRAM reads or writes.</p>

<p>All CPU and command engine accesses are single (non-burst) accesses, so they
take 6 cycles each. However it is <b>not</b> the case that whenever the VRAM
bus is idle for 6 cycles, it can be used for CPU or command engine
accesses.</p>

<p>Instead there are fixed moments in time where there could <i>possibly</i>
start a cpu or command access, let's call these moments access slots. Each slot
can be used for either CPU or command accesses (there are no slots that are
uniquely reserved for either CPU or for commands). The position and the amount
of access slots <i>only</i> depends on the VDP mode (screen off, sprites off,
sprites on), not for example on the amount of actually visible sprites or on
the (bitmap) screen mode.</p>

<p>The 3 tables below show the amount and the positions of the possible access
slots for the 3 different modes (in the timing diagram these are the blue
blocks):</p>

<p><table>
<caption>screen off, 154 possible slots</caption>
<tr><td>   0</td><td>   8</td><td>  16</td><td>  24</td><td>  32</td>
    <td>  40</td><td>  48</td><td>  56</td><td>  64</td><td>  72</td></tr>
<tr><td>  80</td><td>  88</td><td>  96</td><td> 104</td><td> 112</td>
    <td> 120</td><td> 164</td><td> 172</td><td> 180</td><td> 188</td></tr>
<tr><td> 196</td><td> 204</td><td> 212</td><td> 220</td><td> 228</td>
    <td> 236</td><td> 244</td><td> 252</td><td> 260</td><td> 268</td></tr>
<tr><td> 276</td><td> 292</td><td> 300</td><td> 308</td><td> 316</td>
    <td> 324</td><td> 332</td><td> 340</td><td> 348</td><td> 356</td></tr>
<tr><td> 364</td><td> 372</td><td> 380</td><td> 388</td><td> 396</td>
    <td> 404</td><td> 420</td><td> 428</td><td> 436</td><td> 444</td></tr>
<tr><td> 452</td><td> 460</td><td> 468</td><td> 476</td><td> 484</td>
    <td> 492</td><td> 500</td><td> 508</td><td> 516</td><td> 524</td></tr>
<tr><td> 532</td><td> 548</td><td> 556</td><td> 564</td><td> 572</td>
    <td> 580</td><td> 588</td><td> 596</td><td> 604</td><td> 612</td></tr>
<tr><td> 620</td><td> 628</td><td> 636</td><td> 644</td><td> 652</td>
    <td> 660</td><td> 676</td><td> 684</td><td> 692</td><td> 700</td></tr>
<tr><td> 708</td><td> 716</td><td> 724</td><td> 732</td><td> 740</td>
    <td> 748</td><td> 756</td><td> 764</td><td> 772</td><td> 780</td></tr>
<tr><td> 788</td><td> 804</td><td> 812</td><td> 820</td><td> 828</td>
    <td> 836</td><td> 844</td><td> 852</td><td> 860</td><td> 868</td></tr>
<tr><td> 876</td><td> 884</td><td> 892</td><td> 900</td><td> 908</td>
    <td> 916</td><td> 932</td><td> 940</td><td> 948</td><td> 956</td></tr>
<tr><td> 964</td><td> 972</td><td> 980</td><td> 988</td><td> 996</td>
    <td>1004</td><td>1012</td><td>1020</td><td>1028</td><td>1036</td></tr>
<tr><td>1044</td><td>1060</td><td>1068</td><td>1076</td><td>1084</td>
    <td>1092</td><td>1100</td><td>1108</td><td>1116</td><td>1124</td></tr>
<tr><td>1132</td><td>1140</td><td>1148</td><td>1156</td><td>1164</td>
    <td>1172</td><td>1188</td><td>1196</td><td>1204</td><td>1212</td></tr>
<tr><td>1220</td><td>1228</td><td>1268</td><td>1276</td><td>1284</td>
    <td>1292</td><td>1300</td><td>1308</td><td>1316</td><td>1324</td></tr>
<tr><td>1334</td><td>1344</td><td>1352</td><td>1360</td></tr>
</table></p>

<p><table>
<caption>sprites off, 88 possible slots</caption>
<tr><td>   6</td><td>  14</td><td>  22</td><td>  30</td><td>  38</td>
    <td>  46</td><td>  54</td><td>  62</td><td>  70</td><td>  78</td></tr>
<tr><td>  86</td><td>  94</td><td> 102</td><td> 110</td><td> 118</td>
    <td> 162</td><td> 170</td><td> 182</td><td> 188</td><td> 214</td></tr>
<tr><td> 220</td><td> 246</td><td> 252</td><td> 278</td><td> 310</td>
    <td> 316</td><td> 342</td><td> 348</td><td> 374</td><td> 380</td></tr>
<tr><td> 406</td><td> 438</td><td> 444</td><td> 470</td><td> 476</td>
    <td> 502</td><td> 508</td><td> 534</td><td> 566</td><td> 572</td></tr>
<tr><td> 598</td><td> 604</td><td> 630</td><td> 636</td><td> 662</td>
    <td> 694</td><td> 700</td><td> 726</td><td> 732</td><td> 758</td></tr>
<tr><td> 764</td><td> 790</td><td> 822</td><td> 828</td><td> 854</td>
    <td> 860</td><td> 886</td><td> 892</td><td> 918</td><td> 950</td></tr>
<tr><td> 956</td><td> 982</td><td> 988</td><td>1014</td><td>1020</td>
    <td>1046</td><td>1078</td><td>1084</td><td>1110</td><td>1116</td></tr>
<tr><td>1142</td><td>1148</td><td>1174</td><td>1206</td><td>1212</td>
    <td>1266</td><td>1274</td><td>1282</td><td>1290</td><td>1298</td></tr>
<tr><td>1306</td><td>1314</td><td>1322</td><td>1332</td><td>1342</td>
    <td>1350</td><td>1358</td><td>1366</td></tr>
</table></p>

<p><table>
<caption>sprites on, 31 possible slots</caption>
<tr><td>  28</td><td>  92</td><td> 162</td><td> 170</td><td> 188</td>
    <td> 220</td><td> 252</td><td> 316</td><td> 348</td><td> 380</td></tr>
<tr><td> 444</td><td> 476</td><td> 508</td><td> 572</td><td> 604</td>
    <td> 636</td><td> 700</td><td> 732</td><td> 764</td><td> 828</td></tr>
<tr><td> 860</td><td> 892</td><td> 956</td><td> 988</td><td>1020</td>
    <td>1084</td><td>1116</td><td>1148</td><td>1212</td><td>1264</td></tr>
<tr><td>1330</td></tr>
</table></p>

<p>Note that even in the mode 'screen off', when the VRAM bus is otherwise
mostly idle, the access slots are still at least 8 cycles apart. A single
access takes only 6 cycles, so 2 cycles are wasted.</p>

<p>Very roughly speaking in mode 'screen off' there are about twice as many
access slots as in the mode 'sprites off' and about 5 times as many as in the
mode 'sprites on'. This does however <b>not</b> mean that in these modes the
command engine will execute respectively 2&times; and 5&times; as fast. Instead
in the mode 'sprites on' the speed of command execution is mostly limited by
the amount of available access slots, while in the mode 'screen off', the
bottleneck is mostly the speed of the command engine itself.</p>

<p>Also note that the access slots are not evenly spread in time. For
example:</p>
<ul>
<li>In mode 'screen off', the slots are often only 8 cycles apart (measured
from the start of the 1st to the start of the 2nd slot). Though starting
at cycle=120 there's a gap of 44 cycles.</li>
<li>In mode 'sprites off', during the horizontal border, the access slots are
roughly 8 cycles apart like in the previous mode, but during the display
period, the spacing is more like 26 or 32 cycles. The largest gap is now 54
cycles starting at cycle=1212.</li>
<li>In mode 'sprites on', the pattern is again completely different. Here the
slots are roughly 32 or 64 cycles apart. (The border even has slightly larger
gaps than the display area. So contrary to some speculations, the commands do
not execute faster in the horizontal border in this mode). The largest gap is
now 70 cycles, starting at cycle=92. There's even one location where the
smallest gap is also only 8 cycles. (Though if you look at the measurements
you'll see that the slot right after this smallest gap (at cycle=170) is rarely
actually used, even though the command engine is starved for VRAM
bandwidth).</li>
</ul>

<p>These large gaps between the access slots are important. For example if the
CPU is sending data to the VDP at a too fast rate, and this happens right at a
moment where there are no access slots available, then some of the data send by
the CPU is lost. We'll see later in this text that this can even happen
when the time between the incoming CPU requests is (slightly) larger than the
size of the largest gap.</p>


<h5>allocation of access slots</h5>
<p>The access slots can be used for either CPU or VDP command reads or writes.
This section explains how the slots are allocated to these two subsystems.</p>

<p>The basic principle is very simple: the CPU or the command engine take the
first available access slot. And when the CPU and command engine both require
an access slot at the same time, then the CPU gets priority. Though if you look
at the details it is a bit more complicated:</p>

<ul>
<li>When the CPU sends a read or write VRAM request to the VDP, this request is
put in a buffer until it can be handled.</li>
<li>When the CPU sends a new request when there's still a previous request
pending then the old request is lost. More on this below. <i>TODO most logical
is that the old (not the new) request is lost, but actually check this. Though
the Z80 might be too slow to be able to test this.</i></li>
<li>Similarly when the VDP command engine needs to perform a VRAM read or
write, this request is also put in a buffer. This is a different buffer than
the one for CPU requests.</li>
<li>In contrast to the CPU, the command engine is stalled when the command
engine buffer holds a request. So command engine requests can never get
lost.</li>
<li>16 cycles in advance of an access slot the VDP checks whether there is
either a pending CPU or command request. If there's a pending CPU request, that
request will be executed (16 cycles later). If there's no cpu request but there
is a command request then that one will be executed (16 cycles later). So the
CPU takes priority over the command engine. And very important, if there's no
request pending yet, then 16 cycles later nothing will be executed, not even if
a request does arrive within 16 cycles.</li>
</ul>


<h5>cpu access slows down command execution</h5>
<p>A surprising result (at least to me) of these measurements is that the
speed of VDP command execution is reduced while simultaneously doing CPU VRAM
accesses. Looking back this makes sense because the same VRAM access slots are
shared between CPU and command engine and the CPU gets priority.</p>

<p>This effect is clearly noticeable in the mode 'sprites on' but much less in
the other two modes. This is easily explained by looking at the amount of
available access slots in these modes.</p>

<p>The most extreme situation occurs in this test. Execute a HMMV VDP command
(this is the fastest command, see below) while simultaneously executing a long
series of <code>OUT (#98),A</code> instructions (the fastest way to send CPU
write requests). In our measurements, in the mode 'sprites on' the command
execution speed was approximately cut in half! But in the other two modes, the
execution speed was barely influenced. (Actually our test program wasn't
accurate enough to measure any significant speed difference, but theoretically
also in the latter two modes the execution speed should be reduced by a small
amount).</p>


<h5>too fast CPU access</h5>
<p>The fastest way for the Z80 to send read or write VRAM request to the VDP is
by using a sequence of <code>IN A,(#98)</code> or <code>OUT (#98),A</code>
instructions (of course such a sequence always writes the same value or ignores
all but the last read value). This takes 12 Z80 clock cycles per request.
(Instructions like <code>OUT (C),r</code> or <code>OUTI</code> are all slower).
The VDP is clocked at 6&times; the Z80 speed. So when the Z80 sends multiple
requests to the VDP, the minimal distance between these requests, translated to
VDP cycles, is at least 72 VDP cycles. Earlier we saw that the maximal gap
between access slots was 70 VDP cycles, so at first sight there's no problem.
However consider this scenario:</p>

<ul>
<li>Suppose we're in 'sprites on' mode. At time=236, we're 16 cycles before an
access slot. Suppose there's no pending CPU nor command request at this
time. So nothing will get executed at time=252.</li>
<li>A bit later at time=240 there arrives a CPU write request. This request
gets buffered.</li>
<li>At time=252 there is an access slot, but nothing will get executed in this
slot (because this slot wasn't allocated at time=236).</li>
<li>At time=300 we're again 16 cycles before an access slot. Now there is a
pending CPU request, so we'll execute that at time=316.</li>
<li>At time=312 we receive a new CPU write request. This is 312-240=72 VDP
cycles (or 12 Z80 cycles, the duration of a <code>OUT (#98),A</code>
instruction) after the previous request. But the buffer still contains the
previous unhandled request. The new request overwrites the old request!</li>
<li>At time=316 there's an access slot and we've allocated this slot to the CPU
(at time=300). So the pending CPU request gets executed. Though this writes the
data from the new request, the data from the old request is never written!</li>
</ul>

<p>Note that this scenario used a gap of only 64 VDP cycles between access
slots, while there were 72 cycles between the CPU requests. (And the largest
gap between access slots is 70 cycles).</p>

<!--TODO tests on real machine:
  only lost in 'sprites on' mode ??
     OUT (#99),A -> easy lost
     OUT (C),A   -> only very occasionally
     other OUT patterns always OK
-->

<h2>Command engine timing</h2>

<p>The command engine needs access to VRAM. In the previous section we saw when
the VDP will grand access to this subsystem: when there's an access slot
available and when that slot is not already allocated to CPU access. In this
section we'll see when exactly the command engine will generate VRAM access
requests. Obviously the type (read or write) and the rate of these requests
depends on the type of the VDP command that is executing.</p>

<p>Some commands (like HMMV) only need to write to VRAM. Other commands (like
LMMM) need 2 reads and 1 write per pixel. Many commands execute on a block (a
rectangle) of pixels. Such a block is executed line per line (all pixels within
one horizontal line are processed before moving to the next line). Moving from
one line to the next takes some amount of time (but YMMM is an exception, see
below). This means that e.g. a HMMM command on a 20x4 rectangle executes faster
than on a 4x20 rectangle (same amount of pixels in both cases, but a different
rectangle shape).</p>

<p>The following table summarizes the timing for all measured commands:</p>
<table>
<tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
<tr><td>HMMV</td><td>48 W          </td><td>56</td></tr>
<tr><td>YMMM</td><td>40 R 24 W     </td><td>0 </td></tr>
<tr><td>HMMM</td><td>64 R 24 W     </td><td>64</td></tr>
<tr><td>LMMV</td><td>72 R 24 W     </td><td>64</td></tr>
<tr><td>LMMM</td><td>64 R 32 R 24 W</td><td>64</td></tr>
<tr><td>LINE</td><td>88 R 24 W     </td><td>32</td></tr>
</table>
<p><i>TODO timing for PSET, POINT, SRCH</i></p>

<p>I'll explain the notation in this table with an example. Take the LMMM
command:</p>
<ul>
<li>Per pixel the LMMM command needs to:
  <ul><li>Read a byte from the source.</li>
      <li>Read a byte from the destination</li>
      <li>Calculate the result: extract the pixel value from source and
      destination, combine the two (possibly with a logical operation), insert
      the result in the destination byte. And write the result back to the
      destination.</li>
  </ul></li>
<li>So per pixel, the LMMM command will generate 3 VRAM accesses: 2 read
followed by one write. Between these accesses there will be some amount of
time.</li>
<li>For LMMM the table lists '64 R 32 R 24 W'. Let's start at the 1st 'R'
character, this represents the 1st read. Next there's the value 32 and a 2nd
'R', this means that the 2nd read comes <i>at least</i> 32 cycles after the 1st
read. Then there's '24 W', meaning there are <i>at least</i> 24 cycles between
the 2nd read and the write. And the initial value '64' means that there are
<i>at least</i> 64 cycles between the write and the 1st read for the next
pixel.</li>
<li>When moving from one horizontal line to the next in a block command, there
is some extra delay. For the LMMM command this takes 64 extra cycles. So
64+64=128 cycles from the last write of a line till the first read of the next
line.</li>
<li>Note that all these values are the <i>optimal</i> timing values. The actual
delay can be longer because there is no access slot available or the slot is
already allocated for CPU access.</li>
</ul>

<p>All the commands in the table above are block commands except for 'LINE'.
For the LINE command the meaning of the columns 'Per pixel' and 'Per line' may
not be immediately clear:</p>
<ul>
<li>The VDP uses the <a href="http://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm">
Bresenham algorithm</a> the calculate which pixels are part of the line.</li>
<li>This algorithm takes at each iteration one step in the <i>major</i>
direction. The timings for such an iteration are written in the 'Per pixel'
column for the LINE command.</li>
<li>Depending on the slope of the line, in some iterations the Bresenham
algorithm also takes a step in the <i>minor</i> direction. For the VDP such a
minor step takes some extra time (32 cycles). This is written in the 'Per line'
column of the LINE command. (If you look back at the very beginning of this
text, these major and minor steps explain the general octagonal shapes in the
images. The uneven distribution of the access slots explain the
irregularities.)</li>
</ul>

<p>Note that for the YMMM command there's no extra overhead when going from one
horizontal line to the next. This might be related to the fact that a line of
a YMMV command always starts at the left or right border of the screen.</p>

<p><i>TODO What we didn't measure (also couldn't measure with our test setup)
was the delay between the start of the command (when the CPU sends the command
byte to the VDP) and the moment the command actually starts executing (e.g.
when the first read or write command access is send to VRAM). It's logical to
assume that the 'per line' overhead also occurs at the start of the command.
But it's possible there is also some additional 'per command' overhead.</i></p>

<h5>speculation on the slowness of the command engine</h5>
<p>When looking at the above table, we see that the command engine is very
slow. For example in a HMMM command there are 24 cycles between reading a byte
and writing that byte to the new location. Or in a LINE command it takes 32
cycles to take a step in the minor direction. I <i>believe</i> there are two
main reasons for this slowness:</p>
<ul>
<li>I believe that internally the VDP command engine subsystem runs at 1/8 of
the master VDP clock frequency. This matches the observation that all values in
the above table are multiples of 8. It also explains why the access slots are
always at least 8 cycles apart (while a VRAM access only requires 6
cycles).</li>
<li>The command engine gets stalled whenever there's a pending command engine
VRAM request. A VRAM request (CPU or command) only gets handled after it's been
pending for at least 16 cycles. So combined this means the command engine gets
stalled for 16 cycles on every VRAM request it makes. (Note that especially
this point is just speculation).</li>
</ul>

<p>Taking these two points into account, the above table can be rewritten
as:</p>
<table>
<tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
<tr><td>HMMV</td><td>(4&times;8+16) W          </td><td>7&times;8</td></tr>
<tr><td>YMMM</td><td>(3&times;8+16) R (1&times;8+16) W     </td><td>0&times;8</td></tr>
<tr><td>HMMM</td><td>(6&times;8+16) R (1&times;8+16) W     </td><td>8&times;8</td></tr>
<tr><td>LMMV</td><td>(7&times;8+16) R (1&times;8+16) W     </td><td>8&times;8</td></tr>
<tr><td>LMMM</td><td>(6&times;8+16) R (2&times;8+16) R (1&times;8+16) W</td><td>8&times;8</td></tr>
<tr><td>LINE</td><td>(9&times;8+16) R (1&times;8+16) W     </td><td>4&times;8</td></tr>
</table>

<p>When you look at the data in this way, the numbers already look more
reasonable.</p>



<h2>Next steps</h2>

<p>All the information above <i>should</i> already be enough to significantly
improve the accuracy of MSX emulators. The following months I plan to work on
improving openMSX.</p>
<ul>
<li>First I'd like to improve the CPU-VRAM access stuff, so that e.g. too fast
CPU accesses actually result in dropped requests.</li>
<li>Next step is the timing of the VDP commands. This depends on the previous
step because e.g. CPU access slows down command execution.</li>
<li>Still a later step could be to more accurately in time fetch the data
required for display rendering (bitmap, sprites). This is lower priority
because:
 <ul>
  <li>These effects are limited to the visual output. Errors can't influence
  the 'state' of the MSX machine. So it's impossible to write a MSX program
  that checks (= makes a decision based on) the rendering accuracy. (OTOH it is
  possible to check for dropped CPU requests or the speed of the
  commands).</li>
  <li>I don't know of any <i>existing</i> MSX software where this will make a
  noticeable difference. Maybe an idea for a <i>new</i> test is to vary the
  y-coordinates of the sprite(s) within one display line. Thus causing the
  sprite engine to use two different values in the two phases of sprite
  rendering.</li>
  <li>Hmm &hellip; or maybe there is an existing program: the <a
  href="http://users.skynet.be/bk263586/verti.zip">verti</a> demo. On current
  emulators the vertical bars are all equally wide. But on a real MSX there
  are wider and smaller bars, but all are multiples of 8 pixels.</a>
 </ul>
</ul>
<p>I'm afraid this will all still take quite a bit of work.</p>

<p>Anyway, I hope the information in this document is useful. For (other) MSX
emulator developers or for MSX developers in general.</p>

<hr/>
<p align="right" style="font-size:smaller;">
2013/03/30, Wouter Vermaelen
</p>

</body>
</html>
